Imagine that we are some fancy data scientists exploring - once again - the Gapminder data. We are particularly interested in the development of the GDP across time and across countries. Some R-fanatics from GESIS suggested that we use this tidyverse thing to complete our tasks. They also told us that we do not always need to load all of its packages at once.
tidyverse that are needed for importing Excel data and data wrangling.
library(readxl)
library(dplyr)
library(tidyr)
Ok, that wasn’t too hard. But data science is about data, so we have to load the data we are interested in.
sheet = "name_of_your_sheet"
gap_gdp <-
readxl::read_excel(
path = "../data/gapminder/GDPpercapitaconstant2000US.xlsx",
sheet = "Data"
)
# OR: alternatively you could also import the Gapminder GDP data from the csv file
# gap_gdp <-
# readr::read_csv("../data/gapminder/gdppercapita_us_inflation_adjusted.csv")
Have the data been successfully imported? They should be in a tibble with the dimensions 275 x 53. As a further check: The income per person for Algeria for the years 1960, 1961, and 1962 should be 1280, 1085, and 856.
select() and filtering rows by number with slice().
gap_gdp %>%
select(1:4) %>%
slice(n = 5)
## # A tibble: 1 x 4
## `Income per person (fixed 2000 US$)` `1960.0` `1961` `1962`
## <chr> <dbl> <dbl> <dbl>
## 1 Algeria 1280. 1085. 856.
Let’s say that we are interested in the earliest 10 years as well as the most recent 10 years that appear in the dataset. If we want to aggregate the data per year, they should ideally be in long format.
gather(). Additionally, you might want to create a more convenient column name for the variable Income per person (fixed 2000 US$) with rename() (GDP might be a good choice here) and change the variable type for year to integer.
gap_gdp %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
mutate(year = as.integer(year))
## # A tibble: 14,300 x 3
## country year GDP
## <chr> <int> <dbl>
## 1 Abkhazia 1960 NA
## 2 Afghanistan 1960 NA
## 3 Akrotiri and Dhekelia 1960 NA
## 4 Albania 1960 NA
## 5 Algeria 1960 1280.
## 6 American Samoa 1960 NA
## 7 Andorra 1960 NA
## 8 Angola 1960 NA
## 9 Anguilla 1960 NA
## 10 Antigua and Barbuda 1960 NA
## # ... with 14,290 more rows
There are still a lot of missing values we might want to get rid of, and the data are not arranged in a way that is ideal to explore changes over time. For the next tasks, simply re-use the previous code and add the following commands with %>%.
filter() in combination with !is.na.
gap_gdp %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
mutate(year = as.integer(year)) %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP)
## # A tibble: 7,988 x 3
## country year GDP
## <chr> <int> <dbl>
## 1 Malawi 1960 98.6
## 2 China 1960 105.
## 3 Burundi 1960 116.
## 4 Burkina Faso 1960 122.
## 5 Lesotho 1960 132.
## 6 Nepal 1960 139.
## 7 Togo 1960 177.
## 8 India 1960 181.
## 9 Pakistan 1960 187.
## 10 Indonesia 1960 201.
## # ... with 7,978 more rows
Now we have a - more or less - clean dataset for our actual task: calculating the mean values across all countries for each of the first ten years and each of the last ten years. What’s still a little bit distracting is that we have the values for all years between these two periods in the data. However, we might want to use some of these data points in future analyses. Hence, we will do the following analyses ‘on the fly’ (i.e., without creating a new dataset). Let’s start with the first period.
integer, you can simply filter the range of years you are interested in. The first year in the dataset is 1960.
gap_gdp %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
mutate(year = as.integer(year)) %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP) %>%
filter(between(year, 1960, 1969)) %>%
group_by(year) %>%
summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
## year GDP_over_all_countries
## <int> <dbl>
## 1 1960 2863.
## 2 1961 2936.
## 3 1962 3034.
## 4 1963 3127.
## 5 1964 3297.
## 6 1965 3479.
## 7 1966 3562.
## 8 1967 3644.
## 9 1968 3827.
## 10 1969 4015.
Now it should be easy do the same for the 10 most recent years in the dataset…
gap_gdp %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
mutate(year = as.integer(year)) %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP) %>%
filter(year >= 2002) %>%
group_by(year) %>%
summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
## year GDP_over_all_countries
## <int> <dbl>
## 1 2002 7983.
## 2 2003 8113.
## 3 2004 8335.
## 4 2005 8545.
## 5 2006 8899.
## 6 2007 9219.
## 7 2008 8999.
## 8 2009 8463.
## 9 2010 7700.
## 10 2011 7603.